Waybackup v2.1b HOWTO
TTK Ciar, ttk@archive.org
HOWTO document, 2004-02-07
 
 
   This document discusses installing and using waybackup-v2.1b on Linux (or most other UNIXy systems).
 
 
Table of Contents
 
   1. Introduction
   2. Installation
   3. Common Usage
   4. Detailed Synopsis of Use
      4.1 Scrubbing Filenames (-e)
      4.2 Scrubbing File Contents (-S)
      4.3 Waybackupping All Versions (-T)
      4.4 Timestamps and Where They Go (-t)
      4.5 Configuration Files
   5. Limitations of v2.1b
   6. Authors and History
   7. Running Waybackup
           7.1 What You Will See, and What It Means
           7.2 What You Will Get
   8. Misc Notes
   9. Things To Do
 
 
1. INTRODUCTION
 
  Waybackup version 2.1b replacing the existing (functional, but brittle and low-performing) tool at homeserver:~waybackup/waybackup.pl
 
  The object of waybackup is to duplicate the filesystem of one or more webserver, from the Internet Archive's most recent archived files.  It will treat dynamically generated output as a simple file.  Its output is a filesystem heirarchy and a file bundling that heirarchy (ie, a tarball, tar.gz, or zip file).
 
 
Return To Table of Contents
 
2. INSTALLATION
 
  Waybackup requires very little setup.  If all necessary modules have been installed (which they should already be, on all Internet Archive systems of note), all that is needed is the "waybackup" perl script.  Its official place to live is on homeserver, in the "waybackup" user home directory.
 
  The "waybackup" perl script is more or less self-contained, but depend on a few perl modules:
 
  LWP::Simple
  This is a simple http interface module.  It is commonly installed on all systems at the Internet Archive.  If for some reason it is not installed on the system you are using, get the module from "http://www.cpan.org".
 
  File::Path
  File::Find
  Time::Local
  These are all fairly standard modules which should be installed on just about any servers which have perl installed.  Again, if you do not have them, they are available at "http://www.cpan.org".
 
  There are also "INSTALL" and "README" files, which contain approximately the same information as this HOWTO you are reading now.
 
 
Return To Table of Contents
 
3. COMMON USAGE
 
  To create a directory "site.foo_org" with all of foo.org's files in it, and a file called "site.foo_org.zip" with the compressed contents of "site.foo_org" in it, type:
 
   % waybackup foo.org
  If a file called "site.foo_org" already exists, a number will be appended to the end of its name (eg, "site.foo_org.01" and "site.foo_org.01.zip").
 
  If waybackup detects that the user running it is the "waybackup" user, then it will also attempt to copy (via scp) the tarball to the waybackup directory on ftp.archive.org.
 
  To make waybackup display more verbose messages, add a "-v" option, eg:
 
   % waybackup -v foo.org
  To make waybackup display nothing (except error messages, if any), add a "-q" option, eg:
 
   % waybackup -q foo.org
  NOTE: waybackup will create the "site.foo_org" directory in the directory
  "/0/tmp/waybackup/" if the user has write permissions in that directory,
  otherwise it will create "site.foo_org" in the CURRENT WORKING DIRECTORY
  of the invoking user.
 
 
Return To Table of Contents
 
4. DETAILED SYNOPSIS
 
  waybackup [-h --help] [-d] [-t ] [-q] [-v] url [timestamp]

  --help    show this help blurb + exit
  -h        same as --help
  -d        turn on debugging messages (outputs to /tmp/waybackup-dbg)
  -q        quiet output
  -e        "scrub" filenames to avoid special characters
  -S        refrain from "scrubbing" file contents of Archive-specific data
  -t        find newest files older than 
  -T        generate a subdirectory for each timestamped version of the site
  -v        verbose output
  -a        bundle the created filesystem heirarchy into a .tar file
  -z        bundle the created filesystem heirarchy into a .tar.gz file
  -zz       bundle the created filesystem heirarchy into a .zip file (default)
  url       rightmost part of site(s) of interest, eg yahoo.com or .co.uk
  tstamp    same as '-t ', for backwards-compatability (NOT SUPPORTED)
4.1 Scrubbing Filenames
 
    When waybackup creates files with funky names, Windows sometimes has a problem with them.  This can prevent users from unzipping a waybackup onto their Windows machine.  To avoid this problem, the '-e' option was created, so that filenames are 'scrubbed' before the files are created.  When making a waybackup for Windows users, specify the '-e' option.  (Really, it wouldn't hurt to just always use this option -- if the user doesn't like it, they can squawk, and then we can make another waybackup for them without the scrubbing.)
 
4.2 Scrubbing File Contents
 
    When files are retrieved from the Wayback Machine, these files contain additional information intended to make it easy for the user's browser to interact with the Wayback Machine, making it possible to follow links when links are clicked on.  Unfortunately, when a user wants to restore their web site's document root from a waybackup, this additional information gets in the way.  By default, all documents are 'scrubbed' of this information (ie, it is removed).  If for some reason you do not want to scrub documents, use the '-S' option.
 
4.3 Waybackupping All Versions
 
    Normally, waybackup will create one directory which contains all of the site's files and is as complete as possible.  Some users might need multiple copies of their site waybackupped.  To facilitate this, the '-T' option may be used.  This will create one subdirectory within the site subdirectory for each version of the site which is in the archive.  When a timestamp is given, only those versions as old as that timestamp or older will be waybackupped.
 
4.4 Timestamps and -t
 
    There are two ways to invoke waybackup with a timestamp.  This has sown confusion among some users.  The timestamp may be put at the end of the command line, such as in:
  % waybackup -e -v tanis.com 200301020304
    Alternatively, it may be placed after a '-t', with the '-t' appearing anywhere in the command line.  These are all identical to the command line shown above:
  % waybackup -t 200301020304 -e -v tanis.com
  % waybackup -e -t 200301020304 -v tanis.com
  % waybackup -e -v -t 200301020304 tanis.com
    Do not do this:
  % waybackup -t -e 200301020304 -v tanis.com 
    The timestamp must follow immediately after the -t!  If it does not, waybackup will get horribly confused.
 
4.5 Configuration Files
 
  Waybackup uses no configuration files.
 
 
Return To Table of Contents
 
5. IMPORTANT NOTE - LIMITATIONS OVER PREVIOUS VERSION
 
  Waybackup v2.1b uses the Wayback Machine's http-based front-end interface
  to construct its list of available files and retrieve them.  As such, it
  has a few disadvantages over the previous version of waybackup (which
  accessed the back-end directly):
    * Sites with more than 10,000 files will only have the first 10,000
      files retrieved (eg, yahoo.com).
    * It is slower than the previous version.
 
  This was done so that waybackup will continue to work, even as the data
  repository team is evolving the archive's back-end implementation.
 
  As the front-end API is improved, these problems will be rectified.
 
 
Return To Table of Contents
 
6. AUTHORS
 
  Bruce Baumgart and Matt Lee wrote the original front-end-using waybackup, which was never deployed internally at the Internet Archive.
 
  wrote the original back-end-using waybackup.pl, which was deployed internally at the Internet Archive.  It is deprecated, and will soon stop functioning, as the back-end implementation is due to change.
 
  TTK Ciar rewrote Bruce and Matt's waybackup into waybackup-v2.0, making it hypothetically suitable for internal use at the Internet Archive.  It is between 6x and 10x faster than Bruce and Matt's original code.
 
  contact: ttk@archive.org
 
 
Return To Table of Contents
 
7.1 WHAT YOU WILL SEE, AND WHAT IT MEANS
 
  When waybackup is run in "quiet" mode, and it encounters no errors, its output is pretty boring:
 
   % waybackup -q foo.org
   %
  When waybackup is run in "verbose" mode, something like this is displayed:
 
    % waybackup -v foo.org
    found host 'www.foo.org' as of '20020524180602'
    found host 'foo.org' as of '20020605202726'
    creating waybackup directory 'site.foo_org.01'
    pulling data from virtualhost 'www.foo.org' [1 of 2]
    0% .................................................. 100%
       ++++++++++++++++++++++++++++++++++++++++++++++++++ done in 61 seconds
    pulling data from virtualhost 'foo.org' [2 of 2]
0% .................................................. 100% ++++++++++++++++++++++++++++++++++++++++++++++++++ done in 20 seconds bundling into site.foo_org.01.zip finished! total elapsed time 83 seconds 365 files stored at rate of 4.5 files per second %
  Now, what this means:
 
    found host 'www.foo.org' as of '20020524180602'
    found host 'foo.org' as of '20020605202726'
  After an initial delay (of perhaps several seconds), waybackup will find all of the "virtual hosts" associated with the URL specified by the user. In this case, the Archive has the contents of "foo.org" stored under two virtual hosts: www.foo.org and foo.org.  The latest versions of these are dated 2002-05-24 18:06:02, and 2002-06-05 20:27:26.
 
    creating waybackup directory 'site.foo_org.01'
  This means that the directory into which the files are being restored is called "site.foo_org.01" (since there was already a directory called site.foo_org -- successive waybackups of this site would be suffixed with a .02, .03, etc, so that older backups are not overwritten).
 
    pulling data from virtualhost 'www.foo.org' [1 of 2]
    0% .................................................. 100%
  Waybackup will pull data from each virtualhost in turn.  It gets a list of files to be retrieved, and then spawns 50 "child processes" (copies of itself which all run at the same time).  These child processes divvy up the file list amongst themselves, and each process pulls down 1/50'th of the total file list.
 
    0% .................................................. 100%
       ++++++++++++++++++++++++++++++++++++++++++++++++++ done in 61 seconds
  As each child completes its part of the list, a "+" will appear under the "."'s, giving the user a real-time view of how the work is going. 
 
  When there are a lot of files to be transferred, it will take a long time for the first "+"'s to appear.  After a few appear, others will follow it very quickly (since most of the child processes will finish their lists around the same time).  The last few "+"'s might take longer to appear, as some processes will have to deal with larger files than the other processes. When the last "+" appears, the total time elapsed appears to the right.
 
  When there are only a few files to be transferred, a slew of "+"'s will appear immediately (as some processes will have only 1 file to transfer, or none at all), and the rest will trickle in soon thereafter.
 
    pulling data from virtualhost 'foo.org' [2 of 2]
    0% .................................................. 100%
       ++++++++++++++++++++++++++++++++++++++++++++++++++ done in 20 seconds
  This is repeated for each virtual host.
 
    bundling into site.foo_org.01.zip
  When the last virtual host is completed, it will make a bundle file of the destination directory (site.foo_org.01).
 
    finished! total elapsed time 83 seconds
    365 files stored at rate of 4.5 files per second
  At the end, you get a summary of the tool's performance.
 
  Re-running this without the -v option will give a slightly sparser output:
 
    % waybackup foo.org
    creating waybackup directory 'site.foo_org.02'
    pulling data from virtualhost 'www.ciar.org' [1 of 2]
    0% .................................................. 100%
       ++++++++++++++++++++++++++++++++++++++++++++++++++
    pulling data from virtualhost 'ciar.org' [2 of 2]
    0% .................................................. 100%
       ++++++++++++++++++++++++++++++++++++++++++++++++++
    finished! total elapsed time 79 seconds
 
Return To Table of Contents
 
7.2 WHAT YOU WILL GET
 
  Running "waybackup foo.org" would generate a directory, "site.foo_org".
  Inside that directory are the following files and directories:
     AUDIT .......... a log of files which, for whatever reason, did not make
                      it into this backup.  This file might not be present.
     MANIFEST ....... a list of the files which did make it into this backup.
     README ......... a short friendly informative blurb for the recipient.
     foo.org/ ....... a directory containing the files from the first virtual
                      host.  One will be created for each virtual host.
     www.foo.org/ ... a directory for the files of the next virtual host.
 
  Also, one level up from the "site.foo_org" would be "site.foo_org.zip", a compressed file containing all of the files from "site.foo_org", eg:
 
    % ls -l
    drwxr-xr-x    4 ttk      users        4096 Aug 22  2001 site.foo_org
    -rw-r--r--    1 ttk      users     4608080 Dec 11 14:38 site.foo_org.zip
    % ls -l site.foo_org
    -rw-r--r--    1 ttk      users       15738 Aug 22  2001 MANIFEST
    -rw-r--r--    1 ttk      users         574 Aug 22  2001 README
    drwxr-xr-x    9 ttk      users        4096 Aug 22  2001 foo.org
    drwxr-xr-x    8 ttk      users        4096 Aug 22  2001 www.foo.org
    % head -5 site.foo_org/MANIFEST
    4       ./README
    4       ./www.foo.org/?M=A
    4       ./www.foo.org/brian/suck.html
    8       ./www.foo.org/brian/disclaim.html
    4       ./www.foo.org/brian/me.html 
  The MANIFEST is just the output of "du -a", so the format of each line is the size of a file in blocks (usually 1024 bytes), followed by the full pathname of the file.
 
 
Return To Table of Contents
 
8. MISC NOTES
 
  * The child processes rename themselves to "waybackup $hostname $kidnum",
    where $kidnum is 0,1,2,..$THREADPOOL_SIZE-1;  Use 'ps -AHf | grep wayb'
    to view them, or more interestingly:
    % perl -e 'while(1) { system("ps -AHf | grep waybackup | grep -v grep"); sleep(1); }'
    (Hit control-C to make it stop.)
 
  * Waybackup will create the "site.foo_org" directory in the directory
    "/0/tmp/waybackup/" if the user has write permissions in that directory,
    otherwise it will create "site.foo_org" in the CURRENT WORKING DIRECTORY
    of the invoking user.
 
 
Return To Table of Contents
 
9. THINGS TO DO
 
  * Allow the backing up of sites with more than 10,000 files. This will require updating the Wayback Machine's front-end API.
 
  * Implement the -p and -pr options, which allow the user to specify strings which will be used to filter the pathnames of files to be excluded from or included in the backup.  (This will be pretty easy.)
 
  * Let the user change THREADPOOL_SIZE via command line option.
 
  * Automatically reduce THREADPOOL_SIZE when it is larger than the list of files to be fetched.
 
  * Kill hung child processes (haven't seen any so far, but it seems like a potential failure mode).
 
  * Stop README from appearing in MANIFEST list.
 
  * Explain what "back end interface" and "front end interface" mean in the documentation.
 
  TTK Ciar, 2003-01-07, for the Internet Archive, Data Repository group.
 
 
Return To Table of Contents